Cp bench#5
Draft
sudhakarsingh27 wants to merge 3 commits into
Draft
Conversation
Move benchmarking infrastructure for CP attention onto a dedicated branch so it persists outside of stash. The core test suite (test_attention_with_cp.py) stays focused on correctness; this branch layers benchmark/profile/stress configs and a cross-backend consistency check on top. run_attention_with_cp.py changes (worker side): - thd_seqlen_pattern arg supports max/half/linear/alternating/random and explicit comma-separated lengths, so benchmark configs can pin a specific variable-length workload instead of randomizing per-run. - benchmark arg drives a 10-warmup + N-iter timing loop wrapped in cudaProfilerStart/Stop and prints ms/iter for nsys/ncu workflows. - torch.manual_seed(1234) for reproducibility across runs. - CP_CROSS_BACKEND_SAVE_DIR env saves per-rank inputs/outputs as .pt for the cross-backend consistency test to compare without re-running. - Soft import from benchmark_cp so the worker can resolve names like cp_thd_0, bench_8k, bariamis_8k, rl16k without test_attention_with_cp.py needing to know about them. benchmark_cp.py (new): - Stress configs (cp_thd_0..3, cp_thd_swa_0..3) — higher batch/longer seqlen than the core suite. - Llama3-8b-shaped configs (bench_8k/16k/32k). - Variable-length training-workload configs (rl16k, bucket32k/64k/128k, mixed32k, outlier64k) with per-config thd_seqlen_pattern. - Worker-only configs (bariamis_*, bench_84992/86016) for manual invocation against the AG spike investigation log shapes. - test_cp_thd_cross_backend_consistency: runs each backend (p2p/all_gather/a2a) on the same input, saves outputs via CP_CROSS_BACKEND_SAVE_DIR, and asserts pairwise agreement within atol=0.1. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Add 18 SWA training workload configs (6 real workloads × 3 windows) to benchmark_cp.py for benchmarking sliding-window attention with context parallelism. Replace the old single-GPU FusedAttn vs FlashAttn benchmark script with a README documenting full benchmark results (full causal + SWA, cp=2/4/8, p2p/all_gather/a2a) and individual config runner usage. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Re-ran all 6 real-training configs (full causal + SWA{512,1024,2048}) on a
second 8x H100 node with cuDNN 9.21 / NCCL 2.29.7 and replaced the prior
results tables. cp=2 was re-run serially because 4-wide concurrency on a single
node distorted a2a SWA timings ~2x and triggered intermittent
cudaErrorIllegalInstruction on AG SWA configs.
The original-node bucket128k SWA AG cp>=4 'FAIL' matrix is no longer present
on the new node, but a smaller intermittent-crash failure mode (cp=2 SWA AG
under heavy concurrency) was observed; documented as a known issue with the
serial-run workaround.
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Please include a brief summary of the changes, relevant motivation and context.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: